Prediction of spin–spin coupling constants with machine learning in NMR

Abstract Nuclear magnetic resonance (NMR) spectroscopy is one of the most important methods for analyzing the molecular structures of compounds. The objective in this study is to predict indirect spin–spin coupling constants in NMR based on machine learning. We propose important descriptors for predicting indirect spin–spin coupling constants from target pairs of atoms in molecules, and combine the proposed descriptors with molecular descriptors to predict indirect spin–spin coupling constants with LightGBM as a regression analysis method. We construct regression models using a dataset and verify their predictive accuracy, and then confirm that the proposed descriptors can predict indirect spin–spin coupling constants more accurately than the traditional descriptors used to predict chemical shifts.


INTRODUCTION
Nuclear magnetic resonance (NMR) spectroscopy is one of the most important methods for analyzing the molecular structures of compounds.NMR is a spectroscopic technique in which a sample in a magnetic field is irradiated with constant radio waves to observe its spectrum, enabling structural analysis and quantitative measurements of compounds. 1,2Most NMR experiments are performed with radio frequency pulses and Fourier transform.Because it provides useful information on intermolecular and intramolecular interactions and molecular dynamics, in addition to molecular structures, NMR is used in a wide range of fields such as chemistry, materials science, medicine, and life sciences.in the identification of molecular structures using NMR.The signal strength can be calculated from the intensity of the NMR spectra, and provides information on composition ratios and mixing ratios.From indirect spin-spin coupling constants, information on the relative bond distances and angles within molecules can be extracted using relationships between indirect spin-spin coupling constant and geometrical parameters, and some relationships are found empirically and others can be provided by theoretical quantum chemistry.Then, this information is useful in determining the bonds between atoms and the dihedral angles of molecules in stereochemistry.Scalar coupling or J coupling is an indirect interaction between the nuclear spins of two atoms in a magnetic field.The number that comes before the J in the J coupling types (1J, 2J, 3J) denotes the number of bonds between the atoms that are coupling.For example, 1 J CH is the spin-spin coupling constant between a hydrogen atom and a carbon atom separated by one bond (or simply bonded).
To investigate whether the chemical structure of a synthesized compound is consistent with the target structure, NMR results have been predicted from compound structures. 3The above three pieces of information obtained from NMR can be predicted by ab initio quantum chemistry methods from which the density functional theory (DFT) is by far the most computationally efficient.The objective of this study is to construct regression models that predict indirect spin-spin coupling constants considering whole chemical structures for molecules.We propose descriptors containing important information on indirect spin-spin coupling constants from three-dimensional chemical structures, and then construct regression models between the proposed descriptors and the indirect spin-spin coupling constants.The regression models can be used to predict indirect spin-spin coupling constants for new chemical structures.In this study, we demonstrate the effectiveness of the proposed method using a Kaggle dataset. 13

MATERIALS AND METHODS
We use a dataset of the first version of champs-scalar-coupling 13 downloaded from Kaggle 14 that consists of three-dimensional chem- In this study, we construct regression models of the form y = f(X) in which y is the spin-spin coupling constant between a target pair of two atoms and X is descriptors that provide numerical representations of the target pair of two atoms.Traditional descriptors used to predict chemical shifts from three-dimensional chemical structures 7 did not consider distances between atoms.Similarity between chemical environments of each molecular structure 11 can only be combined with regression analysis methods based on the kernel method.The proposed descriptors X for a pair of two atoms are listed in Table 1.
The molecular descriptors computed in RDKit (version 2019.09.1), 15,16 which is an open-source library used in the field of chemoinformatics, are also used in this study, as they were used in the prediction of chemical shifts with machine learning 7 ; examples of these descriptors include the number of bonds of the target atom, the presence or absence of rings, hybridization orbitals, aromaticity, the number of atoms in a molecule, the number of bonds, and the molecular weight.Because both target atoms are hydrogen atoms in the case of 2 J HH and 3 J HH types of scalar coupling, the descriptors are based on the distances between target atoms that are the same, and so the distances between target atoms and oxygen/nitrogen atoms are deleted.
In their place, the average distance between the target hydrogen atom and its nearest carbon or nitrogen atom, and the average of the mean, minimum, maximum, and standard deviation calculated are added as descriptors.We did not select the descriptors based on correlation between descriptors and some importance in this study.We use LightGBM, 17 which is one of the methods with highly predictive ability, as a regression analysis method based on a decision tree and ensemble learning and the Python library of the reference 18 whose version is 2.3.0.The LightGBM parameters are presented in Table 2.

RESULTS AND DISCUSSION
In this study, the data were randomly divided into a training set containing 59 502 molecules (70%) and a test set containing 25 501 molecules (30%).Since the numbers of both training samples and test samples were high and the molecules were randomly divided, diverse chemical structures were included in both training samples and test samples.
Regression models were constructed from the training data for each type of scalar coupling, and y-values were predicted for the test data.
The RDKit descriptors of which values are 95% same were removed.
For the 1 J NH , 2 J NH , and 3 J NH types of scalar coupling, the distance to the nitrogen atom closest to the target atom is zero, and so this descriptor was removed.
Using LightGBM, the proposed method was compared with the traditional method, in which the descriptors are the RDKit descriptors used for chemical shift prediction 7 and the Euclidean distance between the target atoms.The prediction results for the test data for each type of scalar coupling are presented in Table 3, where r 2 is the determinant coefficient (higher values denote better predictive accuracy), RMSE is the root-mean-squared error (lower values denote better predictive accuracy) and RRMSE is the relative RMSE in which RMSE is divided by the standard deviation of each y.As shown in Table 3, for all types of scalar coupling, the proposed method gives higher r 2 and lower RMSE values than the traditional method, indicating the improved predictive accuracy of the proposed method.In particular, the RMSE of 2 J CH reaches ∼30%, with significantly reduced prediction errors.By adding the distances from oxygen and nitrogen atoms as descriptors, we can handle the contribution of oxygen and nitrogen atoms, which are more susceptible to polarity than hydrogen atoms, and the environment around carbon atoms, which bind with many types of atoms.The results confirm that the proposed descriptors work effectively.
Figure 1 shows the plots of actual y-values versus predicted y-values in the test data for each type of scalar coupling.For all types of scalar coupling, the samples given by the proposed method are closer to the diagonal than those given by the traditional method, indicating that the proposed method predicts the indirect spin-spin coupling constants with high accuracy.This confirms that the proposed descriptors can effectively predict the indirect spin-spin coupling constants.
For each type of scalar coupling, descriptors with the highest variable importance in LightGBM are listed in Table 4.For all types of scalar coupling, the descriptors related to the distance to the oxygen or nitrogen atom are in the top three, suggesting that it is useful to consider the environment from the target atom to the closest oxygen or nitrogen atom.For example, in 2 J CH , the distance between the target hydrogen atom and its nearest carbon or nitrogen atom, the distance between the midpoint of the target atoms and its nearest oxygen atom, and the distance between the target hydrogen atom and its nearest oxygen atom are the top three descriptors, indicating that information on Actual y-values vs. predicted y-values with the traditional method and the proposed method for test data for eight types of scalar coupling: 1 J NH , 1 J CH , 2 J HH , 2 J NH , 2 J CH , 3 J HH , 3 J CH , and 3 J NH the oxygen and nitrogen atoms around the target atom is important.In addition, in most cases, the atom having scalar coupling with the hydrogen atom is directly bonded to the hydrogen atom, and because this information is based on the distance between the hydrogen atom and the directly bonded atom, the distance from the atom to having scalar coupling with the hydrogen atom to the target hydrogen atom is considered to be the most important descriptor.However, the fact that some of the samples in the plots in Figure 1 are off the diagonal for each type of scalar coupling confirms that the environment around the atoms is not described sufficiently well.

CONCLUSIONS
In this study, we proposed atom-pair descriptors relating the distance between atoms, hybridization orbitals, and oxygen and nitrogen atoms Euclidean distance between the midpoint of the target atoms and its nearest oxygen atom 3 Euclidean distance between the midpoint of the target atoms and its nearest nitrogen atom 3 J NH 1 Euclidean distance between the target hydrogen atom and its nearest carbon or nitrogen atom 2 Euclidean distance between the midpoint of the target atoms and its nearest oxygen atom 3 Euclidean distance between the target atoms around bonds for three-dimensional molecular structures as a means of predicting the indirect spin-spin coupling constants between atoms.
Regression models were constructed to predict indirect spin-spin coupling constants using the proposed descriptors and molecular descriptors as X-variables with LightGBM.The proposed method was found to be more accurate than the traditional method used in predicting chemical shifts.Furthermore, the importance of the proposed descriptors was confirmed by calculating the most influential descriptors using LightGBM.Indirect spin-spin coupling constants predicted with the proposed method fast would be used as molecular descriptors in virtual screening.However, a limitation of the proposed method is the predictive accuracy compared with DFT calculation.In future work, the accuracy of the indirect spin-spin coupling constant predictions will be improved by adding descriptors that can deal with atoms other than oxygen and nitrogen atoms in the vicinity of the bond.
NMR measurements provide information on chemical shifts, signal intensity, and indirect spin-spin coupling constants.Chemical shifts refer to the resonance frequencies of specific nuclear spins within molecules, and their values are determined by the environment around the nucleus.These are one of the most important components This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.© 2021 The Authors.Analytical Science Advances published by Wiley-VCH GmbH ical structures and indirect spin-spin coupling constants for 85003 molecules.The molecules contained only the atoms: carbon (C), hydrogen (H), nitrogen (N), fluorine (F), and oxygen (O).Because indirect spin-spin coupling constants exist for two atoms in a molecule, there are 4 658 147 sets of spin-spin coupling constant and target atom pairs in the current dataset.The dataset contains data for eight types of scalar coupling: 1 J NH , 1 J CH , 2 J HH , 2 J NH , 2 J CH , 3 J HH ,3 J CH , and 3 J NH , and the numbers of samples are 43363, 709416, 378036, 119253, 1140674, 590611, 1510379, and 166415, respectively.
Bond angles and dihedral angles would be considered in the molecular descriptors provided by RDKit.The descriptors derived from the Euclidean distance between atoms with the same type of scalar coupling in a molecule include the mean, maximum, minimum, and standard deviation, and the difference and quotient of their values.The Euclidean distances between the midpoint of a target atom pair and the closest oxygen/nitrogen atom in a molecule are taken as distance descriptors.When there are no oxygen atoms in a molecule, the distance is set to 1000.The same is true in the absence of nitrogen atoms.

TA B L E 2
Parameters [4][5][6]However, the DFT cal- Prediction results for the test data.The traditional method means LightGBM models with traditional descriptors TA B L E 3 Descriptors with the highest variable importance in LightGBM for eight types of scalar coupling: 1 J NH , 1 J CH , 2 J HH , 2 J NH , 2 J CH , 3 J HH , 3 J CH , and 3 J NH